高性能计算的根本转变在于,从以 CPU 为中心的串行执行模型,转向一种解耦的生产者-消费者模型,其中 CPU 负责管理流水线,而 GPU 独立运行。核心认知是 GPU 不应被当作严格的同步设备来驱动;将其视为同步设备会形成“停等”瓶颈。
1. 工作流程生命周期
在异步思维模式下,开发者不会等待每个任务完成。相反,他们 分配 内存, 启动 内核,并 复制回 通过将非阻塞请求放入硬件队列中,来完成结果的返回。
2. 克服停滞
当主机被迫在每次操作后 同步 时,执行间隙——即 CPU 与 GPU 之间的传输时间——成为性能的主要瓶颈。通过利用 异步性,CPU 可以继续工作,而 GPU 则并行处理其数据流,从而最大化硬件利用率。
$$\text{总时间} = \max(\text{CPU 工作量}, \text{GPU 工作量}) + \text{同步开销}$$
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which set of steps correctly converts a synchronous vector-add to use an explicit stream?
Call hipStreamCreate, use hipMemcpyAsync with the handle, and pass the handle as the 4th kernel argument.
Call hipDeviceSynchronize after every kernel launch and use hipMemcpy.
Set the stream parameter to NULL in all hipMemcpyAsync calls.
Replace hipMalloc with hipHostMalloc exclusively.
✅ Correct!
Correct. Explicit streams require handle creation, async memory operations, and passing the handle to the kernel configuration.❌ Incorrect
Using hipMemcpy (blocking) or the NULL stream (implicitly synchronous) defeats the purpose of the mindset shift.QUESTION 2
Why is a GPU considered 'not meant to be driven as a strictly synchronous device'?
Because it has no internal clock.
Because waiting for the CPU to confirm every command leaves thousands of cores idle.
Because memory transfers cannot be tracked by the CPU.
Because the GPU must manage its own power state.
✅ Correct!
GPU efficiency comes from high-throughput parallel work; synchronizing after every small step creates 'dead air' on the hardware.❌ Incorrect
The issue is latency and core utilization, not hardware clocking or power management.QUESTION 3
What is the primary risk of forcing the host to synchronize after every operation?
Memory corruption.
Host-side stalling and loss of hardware saturation.
Increased power consumption on the GPU.
Kernel compile errors.
✅ Correct!
Synchronous calls block the CPU, preventing it from preparing the next 'chunk' of work for the GPU.❌ Incorrect
While inefficient, it doesn't corrupt memory or cause compilation errors.QUESTION 4
In the logistics warehouse analogy, what does the 'Conveyor Belt' represent?
A HIP Stream.
The GPU Driver.
The CPU Cache.
The VRAM buffer.
✅ Correct!
A stream acts like a conveyor belt, allowing the CPU to load tasks sequentially without waiting for the worker (GPU) to finish the current one.❌ Incorrect
The stream is the FIFO queue mechanism that facilitates the non-blocking 'conveyor' flow.QUESTION 5
True or False: hipMemcpyAsync returns control to the CPU before the data transfer is complete.
True
False
✅ Correct!
Yes! This is the definition of non-blocking. The CPU just enqueues the request and moves on.❌ Incorrect
If it waited, it would be a standard synchronous hipMemcpy.Case Study: The Warehouse Manager's Bottleneck
Asynchrony Implementation
A legacy ROCm application uses standard hipMemcpy and kernel launches without stream handles. The CPU utilization is 98%, but the GPU is only at 15% utilization because it waits for the CPU to finish logging data before starting the next copy.
Q
Explain how Asynchrony would fix this 'stop-and-wait' bottleneck.
Solution:
By using asynchrony, the CPU can enqueue the next data transfer and kernel launch to a HIP stream and immediately return to its logging tasks. This allows the GPU to process the stream in parallel with the CPU's logging, keeping the compute cores saturated.
By using asynchrony, the CPU can enqueue the next data transfer and kernel launch to a HIP stream and immediately return to its logging tasks. This allows the GPU to process the stream in parallel with the CPU's logging, keeping the compute cores saturated.
Q
Provide the code required to create a stream and launch a kernel into it (replacing a default launch).
Solution:
hipStream_t myStream;
hipStreamCreate(&myStream);
myKernel<<<grid, block, 0, myStream>>>(args);Q
What function must be called to ensure the data is fully copied back to the host before the CPU accesses it?
Solution:
hipStreamSynchronize(myStream); must be called. This is the explicit 'handshake' that confirms all previous work in that specific stream is complete.